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Amendments to the Specification: 

This amendment to the specification will replace all prior versions of the present application. In 
reading this, text added by the amendment is underlined and text that is deleted is shown in 
[[double brackets]]. 

Please amend the paragraph beginning on page 9, line 26 as shown: 

Figure 2 illustrates a block diagram of the video module 20 illustrated in Figure 1 . The 
video module 20 includes a processing element (PE) array 100, a block load/store unit 200, a 
global accumulation unit 300, a local CPU 400 and an instruction and data memory 500. In the 
first embodiment the local general-purpose CPU 400 is a 32-bit MPS CPU and includes 32 
scalar registers, and the instruction and data memory 500 is an 8 KB memory. All other units of 
the video module 20, such as the PE array [[400]] 100, the block load/store unit 200, and the 
global accumulation unit 300, are implemented as a video co-processor to the local 32-bit MIPS 
CPU and connected to the latter through the standard MIPS co-processor interface. The block 
load/store unit 200 of each video module 20 is connected to the on-chip shared memory 30 
(Figure 1) via a direct high-bandwidth data path. Alternatively, one high-bandwidth bus is 
shared by all video modules 20. In the first embodiment, the PE array 100 is a two-dimensional 
SIMD (single-instruction multiple-data) 4 x 4 PE array, including 16 video processing elements 
(PEs). Each processing element within the PE array 100 is described in detail below in reference 
to Figure 3. The block store/load unit 200 is described in detail below in reference to Figure 4. 
The global accumulation unit 300 is described in detail below in reference to Figure 5. Each 
video module 20 has a parallel heterogenous architecture extending a conventional RISC 
(reduced instruction set computer) architecture with support for video processing in the form of 
the two-dimensional SIMD 4 x 4 PE array 100, the block load/store unit 200, and the global 
accumulation unit 300. The 4 x 4 PE array 100 is configured according to 4 vertical slices, each 
vertical slice including 4 processing elements. As shown in Figure 2, a first vertical slice 
includes PEs 0-3, a second vertical slice includes PEs 4-7, a third vertical slice includes PEs 8- 
11, and a fourth vertical slice includes PEs 12-15. All PEs in each slice share their own set of 
buses, such as a 32-bit instruction bus, a 16-bit data read bus, a 16-bit data write bus, a 1-bit PE 
mask read bus, and a 1-bit PE mask write bus, with each of the buses having its own set of 
control signals. 
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